Skip to content

feat(scaler): add observability (metrics + tracing) to the external scaler#1634

Open
Fedosin wants to merge 1 commit into
kedacore:mainfrom
Fedosin:scaler-observability
Open

feat(scaler): add observability (metrics + tracing) to the external scaler#1634
Fedosin wants to merge 1 commit into
kedacore:mainfrom
Fedosin:scaler-observability

Conversation

@Fedosin
Copy link
Copy Markdown
Contributor

@Fedosin Fedosin commented May 13, 2026

Add OpenTelemetry-based metrics and distributed tracing to the external
scaler component, which previously had no observability instrumentation.

Shared observability infrastructure is extracted into pkg/observability/
so both the interceptor and scaler reuse the same tracing setup, metrics
provider, and configuration types.

Metrics:

  • scaler.pinger.fetch.duration (histogram) — duration of each queue pinger fetch cycle
  • scaler.pinger.fetch.errors (counter) — total failed pinger fetch cycles
  • scaler.pinger.endpoints (gauge) — number of interceptor endpoints being polled
  • Prometheus /metrics endpoint on port 2223 (configurable via OTEL_PROM_EXPORTER_PORT)
  • Optional OTLP HTTP metrics export (via OTEL_EXPORTER_OTLP_METRICS_ENABLED)

Tracing:

  • OTEL tracing SDK with console, HTTP/protobuf, and gRPC exporters
  • otelgrpc stats handler for automatic gRPC server span instrumentation
  • W3C TraceContext + Baggage propagation

Configuration env vars (same as the interceptor):

  • OTEL_PROM_EXPORTER_ENABLED (default: true)
  • OTEL_PROM_EXPORTER_PORT (default: 2223)
  • OTEL_EXPORTER_OTLP_METRICS_ENABLED (default: false)
  • OTEL_EXPORTER_OTLP_TRACES_ENABLED (default: false)
  • OTEL_EXPORTER_OTLP_TRACES_PROTOCOL (default: console)

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO)
  • Changelog has been updated and is aligned with our changelog requirements
  • Any necessary documentation is added, such as:

Part of #965

Copilot AI review requested due to automatic review settings May 13, 2026 15:14
@Fedosin Fedosin requested a review from a team as a code owner May 13, 2026 15:14
@keda-automation keda-automation requested a review from a team May 13, 2026 15:14
@snyk-io
Copy link
Copy Markdown

snyk-io Bot commented May 13, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds OpenTelemetry-based observability to the external scaler by introducing metric instruments/exporters and distributed tracing, plus wiring them into the scaler’s startup and queue polling logic.

Changes:

  • Added an OTEL metrics provider + instruments for queue pinger fetch duration/errors and endpoint count, with Prometheus and optional OTLP/HTTP export.
  • Added OTEL tracing SDK setup and enabled automatic gRPC server span instrumentation via otelgrpc when tracing is enabled.
  • Wired metrics/tracing configuration into scaler config and main startup, including a /metrics HTTP endpoint.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scaler/tracing/tracing.go Adds OTEL tracing SDK setup and exporter selection for the scaler.
scaler/metrics/provider.go Introduces an OTEL MeterProvider with Prometheus and optional OTLP metric export.
scaler/metrics/instruments.go Defines metric instruments and recording helpers for queue pinger metrics.
scaler/queue_pinger.go Records pinger fetch metrics on each polling cycle.
scaler/queue_pinger_test.go Updates pinger construction in tests for the new instruments parameter.
scaler/main.go Initializes metrics/tracing, adds Prometheus /metrics server, and instruments gRPC when enabled.
scaler/config.go Adds env-configurable metrics and tracing settings for the scaler.
go.mod Adds otelgrpc dependency for gRPC tracing instrumentation.
go.sum Updates sums for added/updated dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scaler/queue_pinger.go Outdated
Comment thread scaler/metrics/provider.go Outdated
@Fedosin Fedosin force-pushed the scaler-observability branch 3 times, most recently from 744e536 to a4f8dbb Compare May 18, 2026 15:08
Copy link
Copy Markdown
Member

@linkvt linkvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments, we can probably reduce the size quite a bit after deduplicating code we already have in the interceptor.

The PR description should also probably not say "Fixes #..." to avoid auto closing the issue.
We should also keep in mind to update the helm chart and the resources in config/ in this repo.

Comment thread scaler/main.go
if err != nil {
setupLog.Error(err, "Kubernetes client config not found")
os.Exit(1)
runtime.Goexit()
Copy link
Copy Markdown
Member

@linkvt linkvt May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for using runtime.Goexit now? If this is intended we should probably also add the defer os.Exit(1) at top as in the interceptor to also stop the grpc server etc after runtime.Goexit has been called.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added defer os.Exit(1) at the top of main (same pattern as the interceptor). All subsequent failures use runtime.Goexit() to ensure defers run.

Comment thread scaler/metrics/instruments.go Outdated
Comment on lines +23 to +24
AttrNamespace = "namespace"
AttrService = "service"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused vars

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment thread scaler/tracing/tracing.go Outdated
@@ -0,0 +1,91 @@
package tracing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a copy of interceptor/tracing/tracing.go, we should deduplicate this code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplicated — extracted shared tracing setup into pkg/observability/tracing.go. Both interceptor and scaler now delegate to it.

@@ -0,0 +1,51 @@
package metrics
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a copy of interceptor/provider/metrics.go, we should deduplicate this code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplicated — extracted shared meter provider factory into pkg/observability/metrics.go. Both interceptor and scaler delegate to it with their respective service names.

Comment thread scaler/queue_pinger.go Outdated
perPod, err := fetchCountsPerPod(ctx, q.lggr, q.getEndpointsFn, q.interceptorNS, q.interceptorSvcName, q.adminPort)
fetchStart := time.Now()
result, err := fetchCountsPerPod(ctx, q.lggr, q.getEndpointsFn, q.interceptorNS, q.interceptorSvcName, q.adminPort)
if q.instruments != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the same pattern as in the interceptor instruments with NewNoopInstruments() and pass that into the components? This makes the code cleaner as we avoid adding a special case for instruments being nil.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added NewNoopInstruments() and all tests now use it instead of nil. The nil check in fetchAndSaveCounts is removed.

meterName = "keda-external-scaler"

// ServiceName is the OTEL service.name used for both metrics and tracing.
ServiceName = "keda-http-external-scaler"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also align the service name of the interceptor to use this naming scheme

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The interceptor already uses keda-http-interceptor as its service name (set in interceptor/tracing/tracing.go). I think that's consistent with the scaler's keda-http-external-scaler. Should we rename the interceptor to something like keda-http-interceptor-proxy or is keda-http-interceptor fine?

Comment thread scaler/config.go
ProfilingAddr string `env:"PROFILING_BIND_ADDRESS" envDefault:""`
// StreamIntervalMS is the interval in milliseconds between stream ticks
StreamIntervalMS int `env:"KEDA_HTTP_SCALER_STREAM_INTERVAL_MS" envDefault:"200"`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also deduplicate the config to ensure it is consistent across components?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — extracted MetricsConfig and TracingConfig into pkg/observability/config.go. Both interceptor and scaler use type aliases to it.

Comment thread scaler/config.go Outdated

type metricsConfig struct {
OtelPrometheusExporterEnabled bool `env:"OTEL_PROM_EXPORTER_ENABLED" envDefault:"true"`
OtelPrometheusExporterPort int `env:"OTEL_PROM_EXPORTER_PORT" envDefault:"2224"`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not use the same port we use for the interceptor metrics?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 2223 (same as the interceptor).

Comment thread scaler/metrics/instruments.go Outdated

// RecordFetch records a completed pinger fetch cycle.
func (i *Instruments) RecordFetch(duration time.Duration, endpointCount int, fetchErr error) {
attrs := api.WithAttributeSet(attribute.NewSet())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed as there are no attributes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -0,0 +1,81 @@
package metrics
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably also add a test like the prometheus_test.go the interceptor has right now?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added scaler/metrics/prometheus_test.go that verifies the histogram, counter, and gauge are correctly emitted.

@Fedosin Fedosin force-pushed the scaler-observability branch from a4f8dbb to f8a351b Compare May 19, 2026 09:32
@keda-automation keda-automation requested a review from a team May 19, 2026 09:33
…caler

Add OpenTelemetry-based metrics and distributed tracing to the external
scaler component, which previously had no observability instrumentation.

Shared observability infrastructure is extracted into pkg/observability/
so both the interceptor and scaler (and future components) reuse the
same tracing setup, metrics provider, and configuration types.

Metrics:
- scaler.pinger.fetch.duration (histogram)
- scaler.pinger.fetch.errors (counter)
- scaler.pinger.endpoints (gauge)
- Prometheus /metrics endpoint on port 2223 (configurable)
- Optional OTLP HTTP metrics export

Tracing:
- OTEL tracing SDK with console, HTTP/protobuf, and gRPC exporters
- otelgrpc stats handler for automatic gRPC server span instrumentation
- W3C TraceContext + Baggage propagation

Relates to: kedacore#965

Signed-off-by: Mikhail Fedosin <mfedosin@redhat.com>
@Fedosin Fedosin force-pushed the scaler-observability branch from f8a351b to 9b6abde Compare May 19, 2026 09:34
@Fedosin
Copy link
Copy Markdown
Contributor Author

Fedosin commented May 19, 2026

Thanks for the detailed review @linkvt — addressed all inline comments in the latest force-push (f8a351b):

  • deduplicated tracing + meter provider into pkg/observability/
  • deduplicated metrics/tracing config types (MetricsConfig, TracingConfig)
  • switched scaler metrics default port to 2223 (same as interceptor)
  • added defer os.Exit(1) + runtime.Goexit() pattern in scaler main
  • removed unused vars and empty attribute set
  • added NewNoopInstruments() and removed nil special-casing
  • added scaler/metrics/prometheus_test.go
  • updated PR description to use Part of #965 instead of Fixes

On your note about Helm/config resources: this repo currently has config/ manifests but no Helm chart directory. Since these env vars are optional and have defaults, runtime behavior is unchanged unless users set them. I can add explicit env var wiring to config/scaler/deployment.yaml in this PR too if you’d prefer that visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants